As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.
In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.
load("ames_train.Rdata")
Use the code block below to load any necessary packages
library(statsr)
library(PairedData)
library(dplyr)
library(ggplot2)
library(devtools)
library(MASS)
library(BAS)
Let’s check some relations between variables which we can make use of in the later stages.
We will create a new variable age, as we will use in many cases.
ames_train <- ames_train %>% mutate(age=2018-Year.Built)
As normally we all think what is the relation between price and age
ggplot(data=ames_train, aes(x=age)) + geom_histogram(aes(y =..density..), bins=30) + geom_density(col="red") +
geom_vline(aes(xintercept = mean(age)),col='red')+
geom_vline(aes(xintercept = median(age)),col='blue')+
geom_text(data = ames_train, aes( x = (mean(ames_train$age)), y = .015, label = 'mean', col='red'), size = 3, parse = T) +
geom_text(data = ames_train,aes( x = (median(ames_train$age)),y = .020, label = 'median'), col='blue', size = 3, parse = T)
Plot shows the multimodal distribution and the right skewness. It means that we may need to use robust statistics.
summary(ames_train$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.0 17.0 43.0 45.8 63.0 146.0
Above summary statistics show that 50% of the houses are aged between 17 and 63 years. And 50% of houses are less than 43 years old.
summary(ames_train$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12789 129762 159467 181190 213000 615000
Above summary statistics show that houses are sold at a max of $615,000. And 50% of houses are priced between 130k and 180k approximately.
Let’s see their relation ship
ggplot(data=ames_train, aes(x=log(age), y=log(price))) + geom_point() + stat_smooth(method = "lm", se = FALSE)
Taken log of price and age. They are better in finding the good relation due to skewness (as seen in Peer assignment 1). From the plot, as the age increases, the price decreases.
What kind of houses are sold a lot?
Cookbook data :
Sale Type (Nominal): Type of sale
WD - Warranty Deed - Conventional
CWD - Warranty Deed - Cash
VWD - Warranty Deed - VA Loan
New - Home just constructed and sold
COD - Court Officer Deed/Estate
Con - Contract 15% Down payment regular terms
ConLw - Contract Low Down payment and low interest
ConLI - Contract Low Interest
ConLD - Contract Low Down
Oth - Other
Sale Condition (Nominal): Condition of sale
Normal - Normal Sale
Abnorml - Abnormal Sale - trade, foreclosure, short sale
AdjLand- Adjoining Land Purchase
Alloca - Allocation - two linked properties with separate deeds, typically condo with a garage unit
Family - Sale between family members
Partial- Home was not completed when last assessed (associated with New Homes)
mosaic_table = table(ames_train$Sale.Condition, ames_train$Sale.Type)
mosaicplot(mosaic_table, "Sale Condition vs Sale Type", xlab="Sale Condition", ylab="Sale Type", color = 2:5, dir=c("v","h"), las=2, off=20)
Inferences
Sale.Condition Abnormal houses are sold in good numbers compared to others (except Normal)Sale.Condition == "Family" are also good with good Sale.TypeSale.Condition == "AdjLand" is very small compared to all other sale conditionsCorresponding summary stats
ames_train %>% filter(!is.na(Sale.Condition)) %>% group_by(Sale.Condition) %>% summarise(count=n(), mean=mean(price), median=median(price), IQR=IQR(price), min=min(price), max=max(price), sd=sd(price))
## # A tibble: 6 x 8
## Sale.Condition count mean median IQR min max sd
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abnorml 61 143740. 132000 55000 12789 475000 76042.
## 2 AdjLand 2 138750 138750 11250 127500 150000 15910.
## 3 Alloca 4 156766. 152556. 11483 142953 179000 15557.
## 4 Family 17 146959. 149000 19000 82500 235000 41490.
## 5 Normal 834 174622. 155500 76000 39300 615000 72270.
## 6 Partial 82 285172. 254146. 161610 115000 611657 107858.
Normal and Abnormal houses are sold in relatively near price compared to AdjLand,Family and AllocaPartial has much variance compared to very low variance of Alloca, Family and AdjLand (check the plot below too)ggplot(data=ames_train, aes(x=factor(Sale.Condition), y=price)) + geom_boxplot()
ames_train %>% ggplot(aes(x=price))+
geom_histogram(aes(fill=factor(ames_train$Sale.Condition)))+
labs(title="Price distribution by Sale.Condition", xlab="Price", ylab="House Count", fill="Sale.Condition", color=1:5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This plot also shows that the house price skewness is influenced partially by Sales not done in Normal condition.
Heat Map of the log(area) vs log(price) gives us where the population density is concentrated. It can see that inferenced that many homes are chosen for the area between exp(7.0) (1096) to exp(7.3) (1480) square feet with the price range around exp(12) ($162,754)
ggplot(data=ames_train, aes(x=log(area), y=log(price),color = "red")) +
geom_point() +
stat_density_2d(aes(fill = ..level..), geom="polygon") +
scale_fill_gradient(low="blue", high="red") +
stat_smooth(method = "lm", se = FALSE)
Below plot can be used to know the Overall.Qual based distribution. It is to know what kind of quality homes present in what area. Eg: lower quality homes are mostly present in the bottom left while higher quality homes are in the top right
ggplot(data=ames_train, aes(x=log(area), y=log(price), color=factor(Overall.Qual), size=factor(Overall.Qual))) +
geom_point(alpha=0.4) +
scale_color_manual(values = c("red","green","yellow","purple","blue","pink","orange","black","violet","grey"))
## Warning: Using size for a discrete variable is not advised.
ames_train %>% ggplot(aes(x=area ,y=log(price)))+ geom_point(color="red")+ geom_smooth(method = "lm", fill="blue")
ames_train %>% ggplot(aes(x=log(area) ,y=log(price)))+ geom_point(color="red")+ geom_smooth(method = "lm", fill="blue")
We'll use log(area) and log(price) for the analysis instead of just area and price to adjust to the right skewness of the price
histogram(ames_train$price)
ames_train %>% ggplot(aes(x=Bldg.Type ,y=log(price), fill=Bldg.Type))+ geom_point(color="red")+ geom_boxplot()
Fam and Twnhs has good variability
ames_train %>% ggplot(aes(x=Sale.Type ,y=log(price), fill=Sale.Type))+ geom_point(color="red")+ geom_boxplot()
Based on EDA, we’ll choose the below variables , area, Lot.Area, price, age, Bldg.Type, Neighborhood, Sale.Type, Sale.Condition, Bsmt.Qual and Bedroom.AbvGr for the initial model analysis. Some params like Bsmt.Qual, Neighborhood, Lot.Area are chosen on assumption
Let’s check for NA values first.
NA_count_cols <- c("area", "Lot.Area", "price", "age", "Bldg.Type", "Neighborhood", "Sale.Type", "Sale.Condition", "Bsmt.Qual", "Bedroom.AbvGr")
NA_count <- c(dim(ames_train[which(is.na(ames_train$area)),])[[1]], dim(ames_train[which(is.na(ames_train$Lot.Area)),])[[1]], dim(ames_train[which(is.na(ames_train$price)),])[[1]], dim(ames_train[which(is.na(ames_train$age)),])[[1]], dim(ames_train[which(is.na(ames_train$Bldg.Type)),])[[1]], dim(ames_train[which(is.na(ames_train$Neighborhood)),])[[1]], dim(ames_train[which(is.na(ames_train$Sale.Type)),])[[1]], dim(ames_train[which(is.na(ames_train$Sale.Condition)),])[[1]], dim(ames_train[which(is.na(ames_train$Bsmt.Qual) | ames_train$Bedroom.AbvGr == ""),])[[1]], dim(ames_train[which(is.na(ames_train$Bedroom.AbvGr)),])[[1]])
NA_data = data.frame(var_name = NA_count_cols, NA_count)
NA_data
## var_name NA_count
## 1 area 0
## 2 Lot.Area 0
## 3 price 0
## 4 age 0
## 5 Bldg.Type 0
## 6 Neighborhood 0
## 7 Sale.Type 0
## 8 Sale.Condition 0
## 9 Bsmt.Qual 21
## 10 Bedroom.AbvGr 0
Only Bsmt.Qual has NA values. As NA in Bsmt.Qual means there is no basement, we’ll replace NA with No and save it in a new column Bsmt.Qual.New
ames_train_bq <- ames_train %>% select(Bsmt.Qual)
ames_train_bq <- sapply(ames_train_bq, as.character)
ames_train_bq[is.na(ames_train_bq)] <- c("No")
ames_train_bq[ames_train_bq==""] <- c("No")
ames_train_bq <- data.frame(ames_train_bq)
colnames(ames_train_bq) <- "Bsmt.Qual.New"
ames_train <- cbind(ames_train, ames_train_bq)
ames_train %>% group_by(Bsmt.Qual.New) %>% summarise(count=n())
## # A tibble: 6 x 2
## Bsmt.Qual.New count
## <fct> <int>
## 1 Ex 87
## 2 Fa 28
## 3 Gd 424
## 4 No 22
## 5 Po 1
## 6 TA 438
All NA's are removed. We can proceed with model.
initial_model <- lm(log(price) ~ log(area) + log(Lot.Area) + age + Bldg.Type + Neighborhood + Sale.Type + Sale.Condition + Bsmt.Qual.New + Bedroom.AbvGr, data=ames_train)
summary(initial_model)
##
## Call:
## lm(formula = log(price) ~ log(area) + log(Lot.Area) + age + Bldg.Type +
## Neighborhood + Sale.Type + Sale.Condition + Bsmt.Qual.New +
## Bedroom.AbvGr, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.82370 -0.07893 0.00673 0.08691 0.65698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0604157 0.2189550 32.246 < 2e-16 ***
## log(area) 0.6549554 0.0267014 24.529 < 2e-16 ***
## log(Lot.Area) 0.0668661 0.0175700 3.806 0.000150 ***
## age -0.0040712 0.0004557 -8.933 < 2e-16 ***
## Bldg.Type2fmCon -0.0438328 0.0397442 -1.103 0.270363
## Bldg.TypeDuplex -0.1240790 0.0324851 -3.820 0.000142 ***
## Bldg.TypeTwnhs -0.1721567 0.0439861 -3.914 9.73e-05 ***
## Bldg.TypeTwnhsE -0.0837392 0.0282645 -2.963 0.003126 **
## NeighborhoodBlueste -0.0510795 0.1116689 -0.457 0.647475
## NeighborhoodBrDale -0.1189985 0.0825160 -1.442 0.149598
## NeighborhoodBrkSide -0.0598267 0.0668764 -0.895 0.371236
## NeighborhoodClearCr -0.0411322 0.0756287 -0.544 0.586659
## NeighborhoodCollgCr -0.0794834 0.0581222 -1.368 0.171786
## NeighborhoodCrawfor 0.0810311 0.0658794 1.230 0.219006
## NeighborhoodEdwards -0.1731047 0.0605792 -2.857 0.004364 **
## NeighborhoodGilbert -0.1687278 0.0602440 -2.801 0.005202 **
## NeighborhoodGreens 0.2473113 0.0985777 2.509 0.012281 *
## NeighborhoodGrnHill 0.4857328 0.1308637 3.712 0.000218 ***
## NeighborhoodIDOTRR -0.2965654 0.0681622 -4.351 1.50e-05 ***
## NeighborhoodMeadowV -0.2631989 0.0694221 -3.791 0.000159 ***
## NeighborhoodMitchel -0.0956257 0.0613052 -1.560 0.119135
## NeighborhoodNAmes -0.0696095 0.0596347 -1.167 0.243398
## NeighborhoodNoRidge 0.0375115 0.0640297 0.586 0.558120
## NeighborhoodNPkVill -0.0017319 0.1016813 -0.017 0.986414
## NeighborhoodNridgHt 0.1108936 0.0584668 1.897 0.058173 .
## NeighborhoodNWAmes -0.1020716 0.0619968 -1.646 0.100014
## NeighborhoodOldTown -0.1524135 0.0667131 -2.285 0.022556 *
## NeighborhoodSawyer -0.0767783 0.0618349 -1.242 0.214667
## NeighborhoodSawyerW -0.1456620 0.0601292 -2.422 0.015602 *
## NeighborhoodSomerst 0.0016004 0.0555321 0.029 0.977014
## NeighborhoodStoneBr 0.1669896 0.0647757 2.578 0.010089 *
## NeighborhoodSWISU -0.1305219 0.0784628 -1.663 0.096546 .
## NeighborhoodTimber 0.0045418 0.0679828 0.067 0.946749
## NeighborhoodVeenker -0.0299609 0.0773795 -0.387 0.698700
## Sale.TypeCon 0.2056553 0.0857099 2.399 0.016613 *
## Sale.TypeConLD 0.0104395 0.0739765 0.141 0.887806
## Sale.TypeConLI 0.0774655 0.0830963 0.932 0.351451
## Sale.TypeConLw 0.0811007 0.0768599 1.055 0.291614
## Sale.TypeCWD 0.2134994 0.1026614 2.080 0.037827 *
## Sale.TypeNew -0.0578622 0.1104776 -0.524 0.600578
## Sale.TypeOth 0.0046133 0.0917232 0.050 0.959897
## Sale.TypeVWD 0.0409980 0.1708287 0.240 0.810386
## Sale.TypeWD 0.0823896 0.0343927 2.396 0.016789 *
## Sale.ConditionAdjLand 0.2147700 0.1268012 1.694 0.090641 .
## Sale.ConditionAlloca 0.2920297 0.0880430 3.317 0.000945 ***
## Sale.ConditionFamily -0.0409188 0.0473810 -0.864 0.388020
## Sale.ConditionNormal 0.1108491 0.0235083 4.715 2.78e-06 ***
## Sale.ConditionPartial 0.2661889 0.1056136 2.520 0.011886 *
## Bsmt.Qual.NewFa -0.2413341 0.0473529 -5.096 4.18e-07 ***
## Bsmt.Qual.NewGd -0.1488620 0.0238543 -6.240 6.57e-10 ***
## Bsmt.Qual.NewNo -0.4501281 0.0478818 -9.401 < 2e-16 ***
## Bsmt.Qual.NewPo -0.2439410 0.1720251 -1.418 0.156504
## Bsmt.Qual.NewTA -0.1818053 0.0307897 -5.905 4.92e-09 ***
## Bedroom.AbvGr -0.0497919 0.0096609 -5.154 3.10e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1659 on 946 degrees of freedom
## Multiple R-squared: 0.8527, Adjusted R-squared: 0.8444
## F-statistic: 103.3 on 53 and 946 DF, p-value: < 2.2e-16
0.65495540.0668661-0.0497919Family0.00407120.8444 which is a good indicator for the initial model, with value almost near to 1Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?
initial_model_bic <- bas.lm(log(price) ~ log(area) + log(Lot.Area) + age + Bldg.Type + Sale.Type + Sale.Condition + Bsmt.Qual.New + Bedroom.AbvGr + Neighborhood,data = ames_train, prior = "BIC", modelprior = uniform())
## Warning in model == got.parents: longer object length is not a multiple of
## shorter object length
## Warning in bas.lm(log(price) ~ log(area) + log(Lot.Area) + age + Bldg.Type
## + : bestmodel violates heredity conditions; resetting to null model
summary(initial_model_bic)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.000000e+00 1.0000 1.0000000 1.000000e+00
## log(area) 1.000000e+00 1.0000 1.0000000 1.000000e+00
## log(Lot.Area) 9.875247e-01 1.0000 1.0000000 0.000000e+00
## age 1.000000e+00 1.0000 1.0000000 1.000000e+00
## Bldg.Type2fmCon 9.043871e-01 1.0000 0.0000000 1.000000e+00
## Bldg.TypeDuplex 9.043871e-01 1.0000 0.0000000 1.000000e+00
## Bldg.TypeTwnhs 9.043871e-01 1.0000 0.0000000 1.000000e+00
## Bldg.TypeTwnhsE 9.043871e-01 1.0000 0.0000000 1.000000e+00
## Sale.TypeCon 6.698193e-11 0.0000 0.0000000 0.000000e+00
## Sale.TypeConLD 6.698193e-11 0.0000 0.0000000 0.000000e+00
## Sale.TypeConLI 6.698193e-11 0.0000 0.0000000 0.000000e+00
## Sale.TypeConLw 6.698193e-11 0.0000 0.0000000 0.000000e+00
## Sale.TypeCWD 6.698193e-11 0.0000 0.0000000 0.000000e+00
## Sale.TypeNew 6.698193e-11 0.0000 0.0000000 0.000000e+00
## Sale.TypeOth 6.698193e-11 0.0000 0.0000000 0.000000e+00
## Sale.TypeVWD 6.698193e-11 0.0000 0.0000000 0.000000e+00
## Sale.TypeWD 6.698193e-11 0.0000 0.0000000 0.000000e+00
## Sale.ConditionAdjLand 9.997803e-01 1.0000 1.0000000 1.000000e+00
## Sale.ConditionAlloca 9.997803e-01 1.0000 1.0000000 1.000000e+00
## Sale.ConditionFamily 9.997803e-01 1.0000 1.0000000 1.000000e+00
## Sale.ConditionNormal 9.997803e-01 1.0000 1.0000000 1.000000e+00
## Sale.ConditionPartial 9.997803e-01 1.0000 1.0000000 1.000000e+00
## Bsmt.Qual.NewFa 1.000000e+00 1.0000 1.0000000 1.000000e+00
## Bsmt.Qual.NewGd 1.000000e+00 1.0000 1.0000000 1.000000e+00
## Bsmt.Qual.NewNo 1.000000e+00 1.0000 1.0000000 1.000000e+00
## Bsmt.Qual.NewPo 1.000000e+00 1.0000 1.0000000 1.000000e+00
## Bsmt.Qual.NewTA 1.000000e+00 1.0000 1.0000000 1.000000e+00
## Bedroom.AbvGr 9.999793e-01 1.0000 1.0000000 1.000000e+00
## NeighborhoodBlueste 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodBrDale 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodBrkSide 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodClearCr 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodCollgCr 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodCrawfor 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodEdwards 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodGilbert 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodGreens 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodGrnHill 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodIDOTRR 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodMeadowV 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodMitchel 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodNAmes 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodNoRidge 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodNPkVill 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodNridgHt 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodNWAmes 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodOldTown 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodSawyer 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodSawyerW 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodSomerst 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodStoneBr 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodSWISU 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodTimber 1.000000e+00 1.0000 1.0000000 1.000000e+00
## NeighborhoodVeenker 1.000000e+00 1.0000 1.0000000 1.000000e+00
## BF NA 1.0000 0.1072144 1.397977e-02
## PostProbs NA 0.8917 0.0956000 1.250000e-02
## R2 NA 0.8504 0.8455000 8.481000e-01
## dim NA 45.0000 41.0000000 4.400000e+01
## logmarg NA -1792.7887 -1795.0215757 -1.797059e+03
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## log(area) 1.000000e+00 1.000000e+00
## log(Lot.Area) 1.000000e+00 1.000000e+00
## age 1.000000e+00 1.000000e+00
## Bldg.Type2fmCon 1.000000e+00 1.000000e+00
## Bldg.TypeDuplex 1.000000e+00 1.000000e+00
## Bldg.TypeTwnhs 1.000000e+00 1.000000e+00
## Bldg.TypeTwnhsE 1.000000e+00 1.000000e+00
## Sale.TypeCon 0.000000e+00 0.000000e+00
## Sale.TypeConLD 0.000000e+00 0.000000e+00
## Sale.TypeConLI 0.000000e+00 0.000000e+00
## Sale.TypeConLw 0.000000e+00 0.000000e+00
## Sale.TypeCWD 0.000000e+00 0.000000e+00
## Sale.TypeNew 0.000000e+00 0.000000e+00
## Sale.TypeOth 0.000000e+00 0.000000e+00
## Sale.TypeVWD 0.000000e+00 0.000000e+00
## Sale.TypeWD 0.000000e+00 0.000000e+00
## Sale.ConditionAdjLand 0.000000e+00 1.000000e+00
## Sale.ConditionAlloca 0.000000e+00 1.000000e+00
## Sale.ConditionFamily 0.000000e+00 1.000000e+00
## Sale.ConditionNormal 0.000000e+00 1.000000e+00
## Sale.ConditionPartial 0.000000e+00 1.000000e+00
## Bsmt.Qual.NewFa 1.000000e+00 1.000000e+00
## Bsmt.Qual.NewGd 1.000000e+00 1.000000e+00
## Bsmt.Qual.NewNo 1.000000e+00 1.000000e+00
## Bsmt.Qual.NewPo 1.000000e+00 1.000000e+00
## Bsmt.Qual.NewTA 1.000000e+00 1.000000e+00
## Bedroom.AbvGr 1.000000e+00 0.000000e+00
## NeighborhoodBlueste 1.000000e+00 1.000000e+00
## NeighborhoodBrDale 1.000000e+00 1.000000e+00
## NeighborhoodBrkSide 1.000000e+00 1.000000e+00
## NeighborhoodClearCr 1.000000e+00 1.000000e+00
## NeighborhoodCollgCr 1.000000e+00 1.000000e+00
## NeighborhoodCrawfor 1.000000e+00 1.000000e+00
## NeighborhoodEdwards 1.000000e+00 1.000000e+00
## NeighborhoodGilbert 1.000000e+00 1.000000e+00
## NeighborhoodGreens 1.000000e+00 1.000000e+00
## NeighborhoodGrnHill 1.000000e+00 1.000000e+00
## NeighborhoodIDOTRR 1.000000e+00 1.000000e+00
## NeighborhoodMeadowV 1.000000e+00 1.000000e+00
## NeighborhoodMitchel 1.000000e+00 1.000000e+00
## NeighborhoodNAmes 1.000000e+00 1.000000e+00
## NeighborhoodNoRidge 1.000000e+00 1.000000e+00
## NeighborhoodNPkVill 1.000000e+00 1.000000e+00
## NeighborhoodNridgHt 1.000000e+00 1.000000e+00
## NeighborhoodNWAmes 1.000000e+00 1.000000e+00
## NeighborhoodOldTown 1.000000e+00 1.000000e+00
## NeighborhoodSawyer 1.000000e+00 1.000000e+00
## NeighborhoodSawyerW 1.000000e+00 1.000000e+00
## NeighborhoodSomerst 1.000000e+00 1.000000e+00
## NeighborhoodStoneBr 1.000000e+00 1.000000e+00
## NeighborhoodSWISU 1.000000e+00 1.000000e+00
## NeighborhoodTimber 1.000000e+00 1.000000e+00
## NeighborhoodVeenker 1.000000e+00 1.000000e+00
## BF 2.247309e-04 2.201862e-05
## PostProbs 2.000000e-04 0.000000e+00
## R2 8.425000e-01 8.461000e-01
## dim 4.000000e+01 4.400000e+01
## logmarg -1.801189e+03 -1.803512e+03
image(initial_model_bic, rotate = FALSE)
As per the summary(initial_model_bic), we can notice that inclusion probability of Sale.Type variable is 0. So we can remove that from our model.
Residuals vs Fitted model : Our model holds good as there is no much influence due to the outliers
plot(initial_model_bic, which=1)
Inclusion Probabilities Model : Indicating that Sale.Type can be removed from the model
plot(initial_model_bic, which=4)
initial_model_aic <- bas.lm(log(price) ~ log(area) + log(Lot.Area) + age + Bldg.Type + Sale.Type + Sale.Condition + Bsmt.Qual.New + Bedroom.AbvGr + Neighborhood,data = ames_train, prior = "AIC", modelprior = uniform())
## Warning in model == got.parents: longer object length is not a multiple of
## shorter object length
## Warning in bas.lm(log(price) ~ log(area) + log(Lot.Area) + age + Bldg.Type
## + : bestmodel violates heredity conditions; resetting to null model
summary(initial_model_aic)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.0000000 1.0000 1.0000000 1.000000e+00
## log(area) 1.0000000 1.0000 1.0000000 1.000000e+00
## log(Lot.Area) 0.9987649 1.0000 1.0000000 0.000000e+00
## age 1.0000000 1.0000 1.0000000 1.000000e+00
## Bldg.Type2fmCon 0.9999948 1.0000 1.0000000 1.000000e+00
## Bldg.TypeDuplex 0.9999948 1.0000 1.0000000 1.000000e+00
## Bldg.TypeTwnhs 0.9999948 1.0000 1.0000000 1.000000e+00
## Bldg.TypeTwnhsE 0.9999948 1.0000 1.0000000 1.000000e+00
## Sale.TypeCon 0.2145579 0.0000 1.0000000 0.000000e+00
## Sale.TypeConLD 0.2145579 0.0000 1.0000000 0.000000e+00
## Sale.TypeConLI 0.2145579 0.0000 1.0000000 0.000000e+00
## Sale.TypeConLw 0.2145579 0.0000 1.0000000 0.000000e+00
## Sale.TypeCWD 0.2145579 0.0000 1.0000000 0.000000e+00
## Sale.TypeNew 0.2145579 0.0000 1.0000000 0.000000e+00
## Sale.TypeOth 0.2145579 0.0000 1.0000000 0.000000e+00
## Sale.TypeVWD 0.2145579 0.0000 1.0000000 0.000000e+00
## Sale.TypeWD 0.2145579 0.0000 1.0000000 0.000000e+00
## Sale.ConditionAdjLand 1.0000000 1.0000 1.0000000 1.000000e+00
## Sale.ConditionAlloca 1.0000000 1.0000 1.0000000 1.000000e+00
## Sale.ConditionFamily 1.0000000 1.0000 1.0000000 1.000000e+00
## Sale.ConditionNormal 1.0000000 1.0000 1.0000000 1.000000e+00
## Sale.ConditionPartial 1.0000000 1.0000 1.0000000 1.000000e+00
## Bsmt.Qual.NewFa 1.0000000 1.0000 1.0000000 1.000000e+00
## Bsmt.Qual.NewGd 1.0000000 1.0000 1.0000000 1.000000e+00
## Bsmt.Qual.NewNo 1.0000000 1.0000 1.0000000 1.000000e+00
## Bsmt.Qual.NewPo 1.0000000 1.0000 1.0000000 1.000000e+00
## Bsmt.Qual.NewTA 1.0000000 1.0000 1.0000000 1.000000e+00
## Bedroom.AbvGr 0.9999979 1.0000 1.0000000 1.000000e+00
## NeighborhoodBlueste 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodBrDale 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodBrkSide 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodClearCr 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodCollgCr 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodCrawfor 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodEdwards 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodGilbert 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodGreens 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodGrnHill 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodIDOTRR 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodMeadowV 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodMitchel 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodNAmes 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodNoRidge 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodNPkVill 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodNridgHt 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodNWAmes 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodOldTown 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodSawyer 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodSawyerW 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodSomerst 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodStoneBr 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodSWISU 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodTimber 1.0000000 1.0000 1.0000000 1.000000e+00
## NeighborhoodVeenker 1.0000000 1.0000 1.0000000 1.000000e+00
## BF NA 1.0000 0.2731246 1.201696e-03
## PostProbs NA 0.7845 0.2143000 9.000000e-04
## R2 NA 0.8504 0.8527000 8.481000e-01
## dim NA 45.0000 54.0000000 4.400000e+01
## logmarg NA -1682.3642 -1683.6619841 -1.689088e+03
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## log(area) 1.000000e+00 1.000000e+00
## log(Lot.Area) 0.000000e+00 1.000000e+00
## age 1.000000e+00 1.000000e+00
## Bldg.Type2fmCon 1.000000e+00 0.000000e+00
## Bldg.TypeDuplex 1.000000e+00 0.000000e+00
## Bldg.TypeTwnhs 1.000000e+00 0.000000e+00
## Bldg.TypeTwnhsE 1.000000e+00 0.000000e+00
## Sale.TypeCon 1.000000e+00 0.000000e+00
## Sale.TypeConLD 1.000000e+00 0.000000e+00
## Sale.TypeConLI 1.000000e+00 0.000000e+00
## Sale.TypeConLw 1.000000e+00 0.000000e+00
## Sale.TypeCWD 1.000000e+00 0.000000e+00
## Sale.TypeNew 1.000000e+00 0.000000e+00
## Sale.TypeOth 1.000000e+00 0.000000e+00
## Sale.TypeVWD 1.000000e+00 0.000000e+00
## Sale.TypeWD 1.000000e+00 0.000000e+00
## Sale.ConditionAdjLand 1.000000e+00 1.000000e+00
## Sale.ConditionAlloca 1.000000e+00 1.000000e+00
## Sale.ConditionFamily 1.000000e+00 1.000000e+00
## Sale.ConditionNormal 1.000000e+00 1.000000e+00
## Sale.ConditionPartial 1.000000e+00 1.000000e+00
## Bsmt.Qual.NewFa 1.000000e+00 1.000000e+00
## Bsmt.Qual.NewGd 1.000000e+00 1.000000e+00
## Bsmt.Qual.NewNo 1.000000e+00 1.000000e+00
## Bsmt.Qual.NewPo 1.000000e+00 1.000000e+00
## Bsmt.Qual.NewTA 1.000000e+00 1.000000e+00
## Bedroom.AbvGr 1.000000e+00 1.000000e+00
## NeighborhoodBlueste 1.000000e+00 1.000000e+00
## NeighborhoodBrDale 1.000000e+00 1.000000e+00
## NeighborhoodBrkSide 1.000000e+00 1.000000e+00
## NeighborhoodClearCr 1.000000e+00 1.000000e+00
## NeighborhoodCollgCr 1.000000e+00 1.000000e+00
## NeighborhoodCrawfor 1.000000e+00 1.000000e+00
## NeighborhoodEdwards 1.000000e+00 1.000000e+00
## NeighborhoodGilbert 1.000000e+00 1.000000e+00
## NeighborhoodGreens 1.000000e+00 1.000000e+00
## NeighborhoodGrnHill 1.000000e+00 1.000000e+00
## NeighborhoodIDOTRR 1.000000e+00 1.000000e+00
## NeighborhoodMeadowV 1.000000e+00 1.000000e+00
## NeighborhoodMitchel 1.000000e+00 1.000000e+00
## NeighborhoodNAmes 1.000000e+00 1.000000e+00
## NeighborhoodNoRidge 1.000000e+00 1.000000e+00
## NeighborhoodNPkVill 1.000000e+00 1.000000e+00
## NeighborhoodNridgHt 1.000000e+00 1.000000e+00
## NeighborhoodNWAmes 1.000000e+00 1.000000e+00
## NeighborhoodOldTown 1.000000e+00 1.000000e+00
## NeighborhoodSawyer 1.000000e+00 1.000000e+00
## NeighborhoodSawyerW 1.000000e+00 1.000000e+00
## NeighborhoodSomerst 1.000000e+00 1.000000e+00
## NeighborhoodStoneBr 1.000000e+00 1.000000e+00
## NeighborhoodSWISU 1.000000e+00 1.000000e+00
## NeighborhoodTimber 1.000000e+00 1.000000e+00
## NeighborhoodVeenker 1.000000e+00 1.000000e+00
## BF 3.726636e-04 5.853706e-06
## PostProbs 3.000000e-04 0.000000e+00
## R2 8.504000e-01 8.455000e-01
## dim 5.300000e+01 4.100000e+01
## logmarg -1.690259e+03 -1.694413e+03
image(initial_model_aic, rotate = FALSE)
As per the summary(initial_model_aic), we can notice that inclusion probability of Sale.Type variable is 0 in Model 1. So we can remove that from our model.
Residuals vs Fitted model : Our model holds good as there is no much influence due to the outliers
plot(initial_model_aic, which=1)
Inclusion Probabilities Model : Indicating that Sale.Type can be removed from the model because of the lower inclusion probability 0.2145579
plot(initial_model_aic, which=4)
Both the model’s output are almost same (with same R2) and the result is to remove the variable Sale.Type. But which is better? As we know that AIC will give penalty for the parameters far less than that of BIC for which the penalty is the number of observations. BIC is parsimonious.
From the above 2 summaries the inclusion probability of the variables
| variable name (with model) | inclusion probability |
|---|---|
| log(Lot.Area) - BIC | 9.875247e-01 |
| log(Lot.Area) - AIC | 0.9987649 |
| Bldg.Type - BIC | 6.698193e-11 |
| Bldg.Type - AIC | 0.2145579 |
We can see that BIC has put more penalty for the variable Bldg.Type. So we’ll follow BIC model
So our new preferred model is
initial_preferred_model <- lm(log(price) ~ log(area) + log(Lot.Area) + age + Bldg.Type + Neighborhood + Sale.Condition + Bsmt.Qual.New + Bedroom.AbvGr, data=ames_train)
One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.
Distribution is normal around 0
hist(initial_preferred_model$residuals, breaks=100)
Residuals are scattered around 0
plot(initial_preferred_model$residuals, main="Residual distribution of the latest model from BIC")
Sorting based on the residuals, we’ll get the outliers as
ames_train[names(head(sort(initial_preferred_model$residuals), 10)),c("price","age", "area", "Lot.Area", "Bldg.Type", "Neighborhood", "Sale.Condition", "Bsmt.Qual.New", "Bedroom.AbvGr")]
## price age area Lot.Area Bldg.Type Neighborhood Sale.Condition
## 428 12789 95 832 9656 1Fam OldTown Abnorml
## 310 184750 11 4676 40094 1Fam Edwards Partial
## 741 40000 98 1317 8500 1Fam IDOTRR Normal
## 268 55000 98 1086 11340 1Fam OldTown Normal
## 206 50000 108 1484 10320 1Fam IDOTRR Abnorml
## 276 150000 41 2944 24572 1Fam Veenker Family
## 559 34900 98 720 7879 1Fam IDOTRR Abnorml
## 181 82500 41 1411 11900 1Fam NWAmes Family
## 511 63000 84 1112 9780 1Fam Edwards Normal
## 645 35311 69 480 9000 1Fam IDOTRR Abnorml
## Bsmt.Qual.New Bedroom.AbvGr
## 428 Fa 2
## 310 Ex 3
## 741 TA 3
## 268 Fa 2
## 206 TA 3
## 276 Gd 3
## 559 TA 2
## 181 TA 3
## 511 TA 4
## 645 TA 1
Bldg.Type values are 1Fam1NormalTAames_train[names(head(sort(initial_preferred_model$residuals, decreasing = TRUE), 10)),c("price","age", "area", "Lot.Area", "Bldg.Type", "Neighborhood", "Sale.Condition", "Bsmt.Qual.New", "Bedroom.AbvGr")]
## price age area Lot.Area Bldg.Type Neighborhood Sale.Condition
## 125 475000 10 2276 11778 1Fam CollgCr Abnorml
## 389 415000 83 3672 19800 1Fam Edwards Normal
## 24 265979 138 2640 11700 1Fam OldTown Normal
## 274 155891 64 992 8842 1Fam Edwards Abnorml
## 607 377500 12 1746 14892 1Fam Gilbert Normal
## 806 231000 93 1512 7609 1Fam Crawfor Normal
## 272 139400 88 1108 7449 1Fam IDOTRR Normal
## 949 120500 78 778 6900 1Fam IDOTRR Normal
## 576 126000 92 919 7000 1Fam IDOTRR Normal
## 486 328900 19 1709 5119 TwnhsE Somerst Abnorml
## Bsmt.Qual.New Bedroom.AbvGr
## 125 Gd 3
## 389 TA 5
## 24 TA 4
## 274 Fa 3
## 607 Ex 3
## 806 Fa 3
## 272 TA 3
## 949 TA 2
## 576 TA 2
## 486 Ex 2
Normal conditionBldg.Type values are 1Fam1TAThe major difference between the observed trend and outlier is the Sale.Condition. Also it could be of different params, but it is the conclusion as per this model
You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).
Since the price is in log, we’ll use exp to get the original dollar values. And we use predict function for in-sample prediction. So the RMSE will be in dollars
initial_predicted_values <- exp(predict(initial_preferred_model))
initial_residues <- ames_train$price - initial_predicted_values
initial_rmse <- sqrt(mean(initial_residues^2))
initial_rmse
## [1] 31033.15
The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called ???overfitting.??? To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.
load("ames_test.Rdata")
Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?
Checking for NA’s and replacing them with ‘No’
ames_test <- ames_test %>% mutate(age=2018-Year.Built)
c(dim(ames_test[which(is.na(ames_test$area)),])[[1]], dim(ames_test[which(is.na(ames_test$Lot.Area)),])[[1]], dim(ames_test[which(is.na(ames_test$price)),])[[1]], dim(ames_test[which(is.na(ames_test$age)),])[[1]], dim(ames_test[which(is.na(ames_test$Bldg.Type)),])[[1]], dim(ames_test[which(is.na(ames_test$Neighborhood)),])[[1]], dim(ames_test[which(is.na(ames_test$Sale.Type)),])[[1]], dim(ames_test[which(is.na(ames_test$Sale.Condition)),])[[1]], dim(ames_test[which(is.na(ames_test$Bsmt.Qual)),])[[1]]+dim(ames_test[which(ames_test$Bedroom.Qual == ""),])[[1]], dim(ames_test[which(is.na(ames_train$Bedroom.AbvGr)),])[[1]])
## Warning: Unknown or uninitialised column: 'Bedroom.Qual'.
## [1] 0 0 0 0 0 0 0 0 22 0
ames_test_bq <- ames_test %>% select(Bsmt.Qual)
ames_test_bq <- sapply(ames_test_bq, as.character)
ames_test_bq[is.na(ames_test_bq)] <- c("No")
ames_test_bq[ames_test_bq==""] <- c("No")
ames_test_bq <- data.frame(ames_test_bq)
colnames(ames_test_bq) <- "Bsmt.Qual.New"
ames_test <- cbind(ames_test, ames_test_bq)
model_var_level <- names(initial_preferred_model$xlevels)
initial_preferred_model$xlevels <- sapply(model_var_level, function(x) union(initial_preferred_model$xlevels[[x]], levels(ames_test[[x]])))
predict_test_data <- exp(predict(initial_preferred_model, newdata=ames_test))
## Warning in predict.lm(initial_preferred_model, newdata = ames_test):
## prediction from a rank-deficient fit may be misleading
initial_test_residue <- ames_test$price - predict_test_data
initial_test_rmse <- sqrt(mean(initial_test_residue^2, na.rm = TRUE))
initial_test_rmse
## [1] 48191.67
The RMSE difference between 48191.67 (for testing data) and 31033.15 (for training data) is very huge. This is due the potential outliers and the model has to be tuned to reduce the difference in the upcoming stages.
Below is the residual comparison of the 2 data sets. We’ll have a variable ‘type’ in the new data frame initial_residual_comparison which will indicate from where the data came from (whether it is training set or testing set). Value 1 is for training set and 2 for the testing set.
residual_comp1 <- data.frame(index<- seq(from=1, to=dim(ames_train)[1], by=1), residue <- initial_residues, type <- rep(1,dim(ames_train)[1]))
residual_comp2 <- data.frame(index<- seq(from=1, to=dim(ames_test)[1], by=1), residue <- initial_test_residue, type <- rep(2,dim(ames_test)[1]))
colnames(residual_comp1) <- c("index", "residue", "type")
colnames(residual_comp2) <- c("index", "residue", "type")
initial_residual_comparison <- rbind(residual_comp1, residual_comp2)
ggplot(initial_residual_comparison, aes(x=index, y=residue)) + geom_point(aes(color=type, alpha=0.5))
We can observe that the testing set’s residue are more dispersed than that of training set.
Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.
Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.
Carefully document the process that you used to come up with your final model, so that you can answer the questions below.
#Function to replace NAs
ames_transform <- function(data){
var_to_use <- data %>% summarise_all(funs(length(unique(.))))
var_to_use <- names(data.frame(var_to_use)[,var_to_use>1])
data <- data %>% select(var_to_use)
data_types <- split(names(data), sapply(data, function(x) paste(class(x), collapse = " ")))
data_int <- data %>% select(data_types$integer)
ames_train_fac <- data %>% select(data_types$factor)
ames_train_fac <- sapply(ames_train_fac, as.character)
ames_train_fac[is.na(ames_train_fac)] <- c("None")
ames_train_fac[ames_train_fac==""] <- c("None")
ames_train_fac <- data.frame(ames_train_fac)
data <- cbind(data_int, ames_train_fac)
data <- data %>% mutate(age= 2018 - Year.Built)
data$Year.Built <- NULL
return(data)
}
Provide the summary table for your model.
ames_train <- ames_transform(ames_train)
final_lm <- lm(formula = log(price)~., data = na.omit(ames_train))
n <- nrow(na.omit(ames_train))
final_BIC <- stepAIC(final_lm, direction = "backward", k=n, trace = 0, steps = 1000)
summary(final_BIC)
##
## Call:
## lm(formula = log(price) ~ Overall.Qual, data = na.omit(ames_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.53757 -0.11989 0.01186 0.13492 0.94285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.494269 0.037090 282.94 <2e-16 ***
## Overall.Qual 0.249820 0.005859 42.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2338 on 779 degrees of freedom
## Multiple R-squared: 0.7001, Adjusted R-squared: 0.6997
## F-statistic: 1818 on 1 and 779 DF, p-value: < 2.2e-16
final_AIC <- stepAIC(final_lm, direction = "backward", k=2, trace = 0, steps = 1000)
summary(final_AIC)
##
## Call:
## lm(formula = log(price) ~ area + Lot.Frontage + Overall.Qual +
## Overall.Cond + Year.Remod.Add + BsmtFin.SF.1 + BsmtFin.SF.2 +
## X1st.Flr.SF + Bsmt.Full.Bath + Kitchen.AbvGr + Fireplaces +
## Garage.Cars + Enclosed.Porch + X3Ssn.Porch + Screen.Porch +
## MS.Zoning + Lot.Shape + Lot.Config + Land.Slope + Neighborhood +
## Condition.1 + Condition.2 + Bldg.Type + Exterior.2nd + Exter.Qual +
## Exter.Cond + Foundation + Bsmt.Qual + Bsmt.Exposure + Heating +
## Central.Air + Kitchen.Qual + Functional + Garage.Type + Garage.Qual +
## Garage.Cond + Sale.Type + Sale.Condition + age, data = na.omit(ames_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.81141 -0.05332 0.00198 0.04965 0.66353
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.932e+00 7.275e-01 10.903 < 2e-16 ***
## area 2.903e-04 1.553e-05 18.688 < 2e-16 ***
## Lot.Frontage -5.980e-04 2.732e-04 -2.189 0.028980 *
## Overall.Qual 4.060e-02 6.312e-03 6.431 2.47e-10 ***
## Overall.Cond 4.451e-02 5.639e-03 7.892 1.28e-14 ***
## Year.Remod.Add 8.696e-04 3.512e-04 2.476 0.013529 *
## BsmtFin.SF.1 8.113e-05 1.497e-05 5.420 8.45e-08 ***
## BsmtFin.SF.2 4.458e-05 2.926e-05 1.523 0.128129
## X1st.Flr.SF 7.318e-05 1.895e-05 3.863 0.000124 ***
## Bsmt.Full.Bath 2.586e-02 1.166e-02 2.217 0.026979 *
## Kitchen.AbvGr 1.304e-01 4.717e-02 2.765 0.005860 **
## Fireplaces 1.656e-02 8.909e-03 1.859 0.063511 .
## Garage.Cars 3.261e-02 1.036e-02 3.147 0.001728 **
## Enclosed.Porch 1.414e-04 8.342e-05 1.694 0.090656 .
## X3Ssn.Porch 2.846e-04 1.285e-04 2.214 0.027152 *
## Screen.Porch 1.704e-04 7.717e-05 2.209 0.027553 *
## MS.ZoningFV 3.616e-01 6.987e-02 5.175 3.05e-07 ***
## MS.ZoningRH 4.993e-01 7.965e-02 6.269 6.68e-10 ***
## MS.ZoningRL 4.566e-01 6.044e-02 7.553 1.46e-13 ***
## MS.ZoningRM 3.617e-01 5.548e-02 6.520 1.42e-10 ***
## Lot.ShapeIR2 2.876e-02 2.793e-02 1.030 0.303529
## Lot.ShapeIR3 3.369e-01 7.783e-02 4.329 1.73e-05 ***
## Lot.ShapeReg -3.381e-03 1.068e-02 -0.317 0.751656
## Lot.ConfigCulDSac -3.945e-02 2.542e-02 -1.552 0.121121
## Lot.ConfigFR2 -5.461e-02 2.646e-02 -2.064 0.039412 *
## Lot.ConfigFR3 -1.396e-01 6.661e-02 -2.095 0.036536 *
## Lot.ConfigInside -3.436e-02 1.183e-02 -2.904 0.003815 **
## Land.SlopeMod -1.417e-02 2.464e-02 -0.575 0.565489
## Land.SlopeSev -2.065e-01 9.578e-02 -2.156 0.031482 *
## NeighborhoodBlueste 6.961e-02 8.468e-02 0.822 0.411377
## NeighborhoodBrDale 7.097e-02 7.033e-02 1.009 0.313254
## NeighborhoodBrkSide 1.326e-01 6.125e-02 2.165 0.030721 *
## NeighborhoodClearCr 1.007e-01 7.317e-02 1.377 0.169015
## NeighborhoodCollgCr -5.692e-03 4.752e-02 -0.120 0.904687
## NeighborhoodCrawfor 1.815e-01 5.721e-02 3.172 0.001583 **
## NeighborhoodEdwards -4.248e-03 5.157e-02 -0.082 0.934381
## NeighborhoodGilbert 5.458e-03 4.960e-02 0.110 0.912414
## NeighborhoodGreens 9.118e-02 8.161e-02 1.117 0.264313
## NeighborhoodIDOTRR 7.521e-02 6.823e-02 1.102 0.270766
## NeighborhoodMeadowV -4.861e-02 7.184e-02 -0.677 0.498854
## NeighborhoodMitchel 3.263e-02 5.206e-02 0.627 0.530971
## NeighborhoodNAmes 2.700e-03 5.119e-02 0.053 0.957945
## NeighborhoodNPkVill 1.007e-01 1.185e-01 0.849 0.395919
## NeighborhoodNWAmes -2.596e-02 5.336e-02 -0.487 0.626707
## NeighborhoodNoRidge 6.638e-02 5.465e-02 1.215 0.224984
## NeighborhoodNridgHt 1.170e-01 4.685e-02 2.496 0.012795 *
## NeighborhoodOldTown 1.619e-02 6.243e-02 0.259 0.795415
## NeighborhoodSWISU 3.746e-02 6.515e-02 0.575 0.565494
## NeighborhoodSawyer 7.070e-02 5.316e-02 1.330 0.183993
## NeighborhoodSawyerW -1.085e-02 4.948e-02 -0.219 0.826487
## NeighborhoodSomerst 1.796e-01 5.343e-02 3.361 0.000823 ***
## NeighborhoodStoneBr 1.730e-01 5.168e-02 3.348 0.000862 ***
## NeighborhoodTimber 5.207e-02 5.404e-02 0.964 0.335627
## NeighborhoodVeenker 1.007e-01 6.162e-02 1.634 0.102739
## Condition.1Feedr -3.234e-02 3.992e-02 -0.810 0.418148
## Condition.1Norm 5.170e-02 3.352e-02 1.543 0.123437
## Condition.1PosA 9.648e-02 5.832e-02 1.654 0.098573 .
## Condition.1PosN 1.059e-01 5.822e-02 1.819 0.069392 .
## Condition.1RRAe 1.244e-02 5.371e-02 0.232 0.816940
## Condition.1RRAn -2.675e-02 5.184e-02 -0.516 0.606011
## Condition.2Feedr 2.724e-01 1.478e-01 1.843 0.065856 .
## Condition.2Norm 3.870e-01 1.395e-01 2.774 0.005703 **
## Condition.2PosA 3.196e-01 1.868e-01 1.711 0.087596 .
## Condition.2PosN -5.597e-01 1.745e-01 -3.207 0.001407 **
## Condition.2RRNn 3.974e-01 1.830e-01 2.172 0.030241 *
## Bldg.Type2fmCon -1.569e-02 4.244e-02 -0.370 0.711766
## Bldg.TypeDuplex -1.703e-01 4.793e-02 -3.553 0.000409 ***
## Bldg.TypeTwnhs -1.589e-01 3.275e-02 -4.852 1.54e-06 ***
## Bldg.TypeTwnhsE -9.328e-02 2.261e-02 -4.126 4.18e-05 ***
## Exterior.2ndBrk Cmn 2.037e-01 1.382e-01 1.474 0.141029
## Exterior.2ndBrkFace 3.468e-01 5.114e-02 6.781 2.71e-11 ***
## Exterior.2ndCmentBd 2.493e-01 5.390e-02 4.625 4.53e-06 ***
## Exterior.2ndHdBoard 2.538e-01 4.651e-02 5.458 6.87e-08 ***
## Exterior.2ndImStucc 2.195e-01 6.187e-02 3.548 0.000416 ***
## Exterior.2ndMetalSd 3.001e-01 4.487e-02 6.687 4.95e-11 ***
## Exterior.2ndPlywood 2.429e-01 4.804e-02 5.058 5.55e-07 ***
## Exterior.2ndStucco 2.968e-01 5.731e-02 5.178 3.00e-07 ***
## Exterior.2ndVinylSd 2.798e-01 4.586e-02 6.102 1.81e-09 ***
## Exterior.2ndWd Sdng 3.021e-01 4.597e-02 6.572 1.03e-10 ***
## Exterior.2ndWd Shng 2.706e-01 4.989e-02 5.424 8.26e-08 ***
## Exter.QualFa 1.002e-01 5.848e-02 1.713 0.087264 .
## Exter.QualGd -3.819e-02 2.893e-02 -1.320 0.187315
## Exter.QualTA -6.860e-02 3.345e-02 -2.051 0.040715 *
## Exter.CondFa -1.436e-01 7.869e-02 -1.825 0.068393 .
## Exter.CondGd 3.875e-04 6.688e-02 0.006 0.995379
## Exter.CondTA 2.412e-02 6.584e-02 0.366 0.714229
## FoundationCBlock 1.469e-02 2.012e-02 0.730 0.465493
## FoundationPConc 5.204e-02 2.168e-02 2.400 0.016679 *
## FoundationSlab 4.752e-02 7.795e-02 0.610 0.542357
## FoundationStone 2.274e-02 6.921e-02 0.328 0.742646
## Bsmt.QualFa -1.756e-01 3.913e-02 -4.487 8.54e-06 ***
## Bsmt.QualGd -5.573e-02 2.061e-02 -2.704 0.007041 **
## Bsmt.QualNone -1.492e-01 1.261e-01 -1.184 0.237041
## Bsmt.QualPo 9.760e-01 1.844e-01 5.293 1.65e-07 ***
## Bsmt.QualTA -8.146e-02 2.665e-02 -3.057 0.002330 **
## Bsmt.ExposureGd 5.316e-02 1.922e-02 2.766 0.005831 **
## Bsmt.ExposureMn -3.019e-02 1.842e-02 -1.639 0.101605
## Bsmt.ExposureNo -2.574e-02 1.284e-02 -2.004 0.045448 *
## Bsmt.ExposureNone -7.072e-02 1.083e-01 -0.653 0.513844
## HeatingGasW 1.701e-01 5.564e-02 3.056 0.002334 **
## HeatingGrav 3.124e-01 1.576e-01 1.983 0.047818 *
## HeatingWall 1.269e-01 1.252e-01 1.014 0.311036
## Central.AirY 1.692e-01 2.992e-02 5.656 2.34e-08 ***
## Kitchen.QualFa -9.575e-02 4.358e-02 -2.197 0.028382 *
## Kitchen.QualGd -3.295e-02 2.350e-02 -1.402 0.161397
## Kitchen.QualPo 2.561e-01 1.451e-01 1.764 0.078136 .
## Kitchen.QualTA -5.048e-02 2.663e-02 -1.896 0.058441 .
## FunctionalMaj2 2.617e-02 1.247e-01 0.210 0.833866
## FunctionalMin1 4.461e-02 8.103e-02 0.550 0.582197
## FunctionalMin2 8.289e-02 7.833e-02 1.058 0.290368
## FunctionalMod 5.267e-02 8.112e-02 0.649 0.516385
## FunctionalSal -2.650e-01 1.558e-01 -1.701 0.089496 .
## FunctionalTyp 1.222e-01 7.334e-02 1.666 0.096249 .
## Garage.TypeAttchd 3.516e-02 4.443e-02 0.791 0.429055
## Garage.TypeBasment -2.603e-02 6.703e-02 -0.388 0.697870
## Garage.TypeBuiltIn -3.429e-02 4.833e-02 -0.710 0.478217
## Garage.TypeDetchd 1.337e-02 4.555e-02 0.293 0.769302
## Garage.QualFa -1.400e-01 1.149e-01 -1.219 0.223333
## Garage.QualGd -9.882e-02 1.274e-01 -0.776 0.438054
## Garage.QualPo -7.601e-01 1.756e-01 -4.328 1.75e-05 ***
## Garage.QualTA -1.485e-01 1.126e-01 -1.318 0.187845
## Garage.CondFa -1.207e-01 3.542e-02 -3.407 0.000698 ***
## Garage.CondGd 8.017e-02 7.719e-02 1.039 0.299404
## Garage.CondPo 2.909e-01 6.710e-02 4.336 1.69e-05 ***
## Garage.CondTA NA NA NA NA
## Sale.TypeCWD 6.867e-02 7.013e-02 0.979 0.327828
## Sale.TypeCon 3.059e-02 5.939e-02 0.515 0.606642
## Sale.TypeConLD 1.235e-01 6.102e-02 2.024 0.043427 *
## Sale.TypeConLI -7.701e-02 7.441e-02 -1.035 0.301046
## Sale.TypeConLw 1.622e-02 5.447e-02 0.298 0.766008
## Sale.TypeNew -6.809e-02 7.743e-02 -0.879 0.379531
## Sale.TypeOth 1.952e-01 8.477e-02 2.303 0.021583 *
## Sale.TypeVWD -2.004e-02 1.181e-01 -0.170 0.865330
## Sale.TypeWD -4.421e-03 2.731e-02 -0.162 0.871457
## Sale.ConditionAlloca 1.446e-01 8.545e-02 1.692 0.091059 .
## Sale.ConditionFamily 5.949e-03 3.708e-02 0.160 0.872575
## Sale.ConditionNormal 7.821e-02 1.881e-02 4.157 3.66e-05 ***
## Sale.ConditionPartial 1.608e-01 7.317e-02 2.198 0.028320 *
## age -2.179e-03 4.728e-04 -4.608 4.90e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1064 on 643 degrees of freedom
## Multiple R-squared: 0.9487, Adjusted R-squared: 0.9378
## F-statistic: 86.81 on 137 and 643 DF, p-value: < 2.2e-16
Since BIC returns Overall.Qual as the significant predictor, we’ll ignore BIC model and use the AIC model instead.
And variables like below can be removed as the p-value with significance level of 5%
Forward Selection of variables using AIC is done for the selection of better variables with the below code
## R CODE to do the forward selection
variables_chosen <- c()
var_count = 0
variables1 <- names(ames_train)
variables2 <- names(ames_train)
latest_AIC = 1/0
for(vartest in variables1){
include_var = ""
found = FALSE
for(var in variables2){
if(var == "price" || var %in% variables_chosen){
next
}
if(var == "area"){
var = "log(area)"
}
temp_variables_chosen <- variables_chosen
temp_variables_chosen[[length(temp_variables_chosen)+1]] = var
tempAIC = AIC(lm(as.formula(paste("log(price) ~ ", paste(c(temp_variables_chosen), collapse="+"))), data=ames_train ))
if(tempAIC < latest_AIC){
latest_AIC = tempAIC
include_var = var;
found = TRUE
}
}
if(found){
var_count = var_count + 1
variables_chosen[[var_count]] = include_var
}
}
print(variables_chosen)
print(latest_AIC)
> print(variables_chosen)
[1] "Overall.Qual" "log(area)" "Neighborhood" "BsmtFin.SF.1" "Overall.Cond" "age" "Bldg.Type" "Total.Bsmt.SF"
[9] "Sale.Condition" "Condition.2" "Exter.Cond" "Bsmt.Exposure" "MS.Zoning" "Garage.Cars" "Exterior.1st" "Exter.Qual"
[17] "Condition.1" "Bsmt.Full.Bath" "Central.Air" "Lot.Shape" "Garage.Cond" "Bsmt.Qual" "Garage.Qual" "Kitchen.Qual"
[25] "Garage.Yr.Blt" "Fireplaces" "Lot.Config" "Heating" "Heating.QC" "Sale.Type" "Kitchen.AbvGr" "Functional"
[33] "Year.Remod.Add" "Screen.Porch" "Wood.Deck.SF" "Foundation" "X3Ssn.Porch" "Bsmt.Cond" "Bedroom.AbvGr" "Paved.Drive"
[41] "Open.Porch.SF"
> print(latest_AIC)
[1] -1485.826
Even then, when we use it for fitting the test data it is not enough. So with the knowledge obtained from the summary based on p-value from summary(final_AIC) and the above forward selection, we narrowed it to the below model after many iterations.
aic_final_model <- stepAIC(lm(formula = log(price) ~ log(area) + Overall.Qual + X1st.Flr.SF + Kitchen.AbvGr + Garage.Cars + Enclosed.Porch + Screen.Porch + MS.Zoning + Land.Slope + Neighborhood + Bldg.Type + Exterior.2nd + Bsmt.Qual + Central.Air + Sale.Condition + age, data = na.omit(ames_train)), direction = "backward", k=2, trace = 0, steps = 1000)
summary(aic_final_model)
##
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Qual + X1st.Flr.SF +
## Kitchen.AbvGr + Garage.Cars + Enclosed.Porch + Screen.Porch +
## MS.Zoning + Neighborhood + Bldg.Type + Exterior.2nd + Bsmt.Qual +
## Central.Air + Sale.Condition + age, data = na.omit(ames_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.19931 -0.06467 -0.00153 0.07681 0.60390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.018e+00 1.954e-01 41.035 < 2e-16 ***
## log(area) 3.581e-01 2.581e-02 13.877 < 2e-16 ***
## Overall.Qual 6.522e-02 7.468e-03 8.733 < 2e-16 ***
## X1st.Flr.SF 1.350e-04 2.058e-05 6.559 1.04e-10 ***
## Kitchen.AbvGr 8.835e-02 5.786e-02 1.527 0.127204
## Garage.Cars 4.299e-02 1.234e-02 3.484 0.000524 ***
## Enclosed.Porch 2.745e-04 1.034e-04 2.654 0.008141 **
## Screen.Porch 3.121e-04 9.708e-05 3.214 0.001366 **
## MS.ZoningFV 2.511e-01 8.501e-02 2.954 0.003238 **
## MS.ZoningRH 2.246e-01 9.905e-02 2.268 0.023631 *
## MS.ZoningRL 2.660e-01 7.325e-02 3.632 0.000302 ***
## MS.ZoningRM 1.947e-01 6.743e-02 2.887 0.004001 **
## NeighborhoodBlueste 8.588e-02 1.086e-01 0.791 0.429272
## NeighborhoodBrDale 7.253e-02 8.923e-02 0.813 0.416611
## NeighborhoodBrkSide 1.105e-01 7.559e-02 1.462 0.144207
## NeighborhoodClearCr 1.523e-01 8.993e-02 1.693 0.090863 .
## NeighborhoodCollgCr 4.454e-02 6.156e-02 0.724 0.469558
## NeighborhoodCrawfor 2.423e-01 7.139e-02 3.395 0.000725 ***
## NeighborhoodEdwards 2.050e-02 6.480e-02 0.316 0.751883
## NeighborhoodGilbert 2.608e-02 6.356e-02 0.410 0.681688
## NeighborhoodGreens 2.453e-01 1.026e-01 2.392 0.016995 *
## NeighborhoodIDOTRR 1.559e-02 8.320e-02 0.187 0.851395
## NeighborhoodMeadowV -3.530e-02 9.255e-02 -0.381 0.703044
## NeighborhoodMitchel 8.910e-02 6.597e-02 1.351 0.177276
## NeighborhoodNAmes 5.776e-02 6.432e-02 0.898 0.369470
## NeighborhoodNPkVill 7.785e-02 1.570e-01 0.496 0.620156
## NeighborhoodNWAmes 1.324e-02 6.748e-02 0.196 0.844460
## NeighborhoodNoRidge 1.672e-01 6.951e-02 2.405 0.016404 *
## NeighborhoodNridgHt 1.884e-01 6.027e-02 3.126 0.001844 **
## NeighborhoodOldTown 7.102e-02 7.753e-02 0.916 0.360008
## NeighborhoodSWISU 3.565e-02 8.204e-02 0.435 0.663973
## NeighborhoodSawyer 8.171e-02 6.727e-02 1.215 0.224849
## NeighborhoodSawyerW 2.703e-02 6.366e-02 0.425 0.671259
## NeighborhoodSomerst 1.218e-01 6.812e-02 1.788 0.074248 .
## NeighborhoodStoneBr 2.382e-01 6.626e-02 3.595 0.000347 ***
## NeighborhoodTimber 1.280e-01 6.966e-02 1.837 0.066595 .
## NeighborhoodVeenker 1.860e-01 7.797e-02 2.385 0.017333 *
## Bldg.Type2fmCon -1.973e-03 5.174e-02 -0.038 0.969597
## Bldg.TypeDuplex -1.791e-01 6.008e-02 -2.981 0.002968 **
## Bldg.TypeTwnhs -1.732e-01 3.854e-02 -4.494 8.14e-06 ***
## Bldg.TypeTwnhsE -1.019e-01 2.550e-02 -3.997 7.09e-05 ***
## Exterior.2ndBrk Cmn 2.602e-01 1.812e-01 1.436 0.151531
## Exterior.2ndBrkFace 3.553e-01 6.463e-02 5.497 5.37e-08 ***
## Exterior.2ndCmentBd 3.020e-01 6.674e-02 4.526 7.05e-06 ***
## Exterior.2ndHdBoard 2.788e-01 5.794e-02 4.811 1.83e-06 ***
## Exterior.2ndImStucc 2.708e-01 7.892e-02 3.432 0.000634 ***
## Exterior.2ndMetalSd 3.243e-01 5.583e-02 5.809 9.46e-09 ***
## Exterior.2ndPlywood 2.471e-01 5.962e-02 4.144 3.82e-05 ***
## Exterior.2ndStucco 4.174e-01 7.059e-02 5.913 5.19e-09 ***
## Exterior.2ndVinylSd 2.996e-01 5.688e-02 5.267 1.84e-07 ***
## Exterior.2ndWd Sdng 3.110e-01 5.752e-02 5.407 8.75e-08 ***
## Exterior.2ndWd Shng 3.173e-01 6.212e-02 5.108 4.18e-07 ***
## Bsmt.QualFa -2.077e-01 4.643e-02 -4.473 8.96e-06 ***
## Bsmt.QualGd -9.612e-02 2.366e-02 -4.062 5.41e-05 ***
## Bsmt.QualNone -2.993e-01 5.225e-02 -5.729 1.49e-08 ***
## Bsmt.QualPo 2.508e-01 1.618e-01 1.550 0.121482
## Bsmt.QualTA -1.466e-01 3.158e-02 -4.644 4.07e-06 ***
## Central.AirY 2.411e-01 2.998e-02 8.041 3.66e-15 ***
## Sale.ConditionAlloca 1.238e-01 1.096e-01 1.130 0.258966
## Sale.ConditionFamily -6.395e-02 4.560e-02 -1.403 0.161192
## Sale.ConditionNormal 7.469e-02 2.294e-02 3.255 0.001187 **
## Sale.ConditionPartial 9.074e-02 3.026e-02 2.998 0.002807 **
## age -2.582e-03 4.994e-04 -5.170 3.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1431 on 718 degrees of freedom
## Multiple R-squared: 0.8965, Adjusted R-squared: 0.8875
## F-statistic: 100.3 on 62 and 718 DF, p-value: < 2.2e-16
It has the Adjusted R-squared value as .8875 indicating that 88.75 of the price houses are explained by the included variables.
Did you decide to transform any variables? Why or why not? Explain in a few sentences.
We have transformed some variables like area, price and Lot.Area for the reasons mentioned in section 2.1.4.1 and also the adjusted R squared value will be higher.
summary(lm(formula= log(price)~Lot.Area, data = ames_train))$adj.r.squared < summary(lm(formula= log(price)~log(Lot.Area), data = ames_train))$adj.r.squared
## [1] TRUE
Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.
age section 2.1.1 of EDA, as it will be easier to do the comparison between any numberical variables. One such relation has been shown in that section.pairs(~Overall.Qual + Overall.Cond + BsmtFin.SF.1 + Garage.Cars + Bsmt.Qual + age + area,data=ames_train, main="Colinearity Check")
What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.
We’ll use the variables which we have selected in the section 3.1. But when we do the model for good data (i.e. with normal sale condition), we can see the increase in the Adjusted R-squared from 0.8875 to 0.9087. This means that without the outliers, the model can very well explain the house prices
ames_train_no_outliers <- ames_train %>% filter(Pool.QC != "None" | tolower(Sale.Condition) == "normal")
no_outlier_final_model <- stepAIC(lm(formula = log(price) ~ log(area) + Overall.Qual + X1st.Flr.SF + Kitchen.AbvGr + Garage.Cars + Enclosed.Porch + Screen.Porch + MS.Zoning + Land.Slope + Neighborhood + Bldg.Type + Exterior.2nd + Bsmt.Qual + Central.Air + age, data = na.omit(ames_train_no_outliers)), direction = "backward", k=2, trace = 0, steps = 1000)
summary(no_outlier_final_model)
##
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Qual + X1st.Flr.SF +
## Garage.Cars + Screen.Porch + MS.Zoning + Land.Slope + Neighborhood +
## Bldg.Type + Exterior.2nd + Bsmt.Qual + Central.Air + age,
## data = na.omit(ames_train_no_outliers))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56395 -0.05978 0.00161 0.06677 0.56112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.045e+00 1.844e-01 43.628 < 2e-16 ***
## log(area) 3.924e-01 2.255e-02 17.400 < 2e-16 ***
## Overall.Qual 6.180e-02 6.688e-03 9.241 < 2e-16 ***
## X1st.Flr.SF 1.386e-04 1.922e-05 7.213 1.71e-12 ***
## Garage.Cars 6.011e-02 1.062e-02 5.660 2.37e-08 ***
## Screen.Porch 2.391e-04 8.587e-05 2.784 0.005546 **
## MS.ZoningFV 3.675e-01 8.210e-02 4.476 9.15e-06 ***
## MS.ZoningRH 2.484e-01 9.038e-02 2.749 0.006168 **
## MS.ZoningRL 3.524e-01 6.795e-02 5.187 2.96e-07 ***
## MS.ZoningRM 2.468e-01 6.377e-02 3.870 0.000121 ***
## Land.SlopeMod 2.701e-02 2.569e-02 1.052 0.293438
## Land.SlopeSev 4.237e-01 1.246e-01 3.400 0.000720 ***
## NeighborhoodBlueste 1.666e-01 1.028e-01 1.620 0.105756
## NeighborhoodBrDale 1.178e-01 9.266e-02 1.272 0.203970
## NeighborhoodBrkSide 1.749e-01 8.286e-02 2.110 0.035252 *
## NeighborhoodClearCr 1.967e-01 9.092e-02 2.164 0.030893 *
## NeighborhoodCollgCr 7.891e-02 7.296e-02 1.082 0.279849
## NeighborhoodCrawfor 2.768e-01 8.053e-02 3.438 0.000629 ***
## NeighborhoodEdwards 6.477e-02 7.572e-02 0.855 0.392670
## NeighborhoodGilbert 6.624e-02 7.530e-02 0.880 0.379410
## NeighborhoodGreens 2.845e-01 9.679e-02 2.939 0.003420 **
## NeighborhoodIDOTRR 1.556e-01 8.831e-02 1.762 0.078524 .
## NeighborhoodMeadowV -3.253e-02 9.529e-02 -0.341 0.732979
## NeighborhoodMitchel 1.400e-01 7.549e-02 1.855 0.064158 .
## NeighborhoodNAmes 1.209e-01 7.517e-02 1.609 0.108146
## NeighborhoodNPkVill 1.300e-01 1.359e-01 0.957 0.339102
## NeighborhoodNWAmes 8.797e-02 7.697e-02 1.143 0.253531
## NeighborhoodNoRidge 1.998e-01 7.682e-02 2.601 0.009534 **
## NeighborhoodNridgHt 1.988e-01 7.215e-02 2.756 0.006031 **
## NeighborhoodOldTown 1.524e-01 8.417e-02 1.811 0.070645 .
## NeighborhoodSWISU 5.465e-02 8.720e-02 0.627 0.531128
## NeighborhoodSawyer 1.289e-01 7.681e-02 1.678 0.093857 .
## NeighborhoodSawyerW 7.075e-02 7.442e-02 0.951 0.342146
## NeighborhoodSomerst 1.333e-01 8.047e-02 1.656 0.098250 .
## NeighborhoodStoneBr 1.948e-01 7.829e-02 2.489 0.013099 *
## NeighborhoodTimber 1.189e-01 7.836e-02 1.517 0.129835
## NeighborhoodVeenker 2.335e-01 8.210e-02 2.845 0.004604 **
## Bldg.Type2fmCon 2.917e-03 3.570e-02 0.082 0.934904
## Bldg.TypeDuplex -1.302e-01 3.207e-02 -4.061 5.55e-05 ***
## Bldg.TypeTwnhs -1.492e-01 3.415e-02 -4.370 1.47e-05 ***
## Bldg.TypeTwnhsE -7.465e-02 2.465e-02 -3.028 0.002567 **
## Exterior.2ndBrk Cmn 4.004e-02 1.485e-01 0.270 0.787568
## Exterior.2ndBrkFace 1.346e-01 5.930e-02 2.270 0.023590 *
## Exterior.2ndCmentBd 1.808e-01 6.478e-02 2.791 0.005434 **
## Exterior.2ndHdBoard 1.001e-01 5.289e-02 1.893 0.058912 .
## Exterior.2ndImStucc 9.085e-02 6.949e-02 1.307 0.191596
## Exterior.2ndMetalSd 1.199e-01 5.132e-02 2.337 0.019802 *
## Exterior.2ndPlywood 4.993e-02 5.417e-02 0.922 0.357037
## Exterior.2ndStucco 2.039e-01 6.236e-02 3.269 0.001143 **
## Exterior.2ndVinylSd 1.121e-01 5.244e-02 2.137 0.032994 *
## Exterior.2ndWd Sdng 1.235e-01 5.213e-02 2.369 0.018138 *
## Exterior.2ndWd Shng 1.336e-01 5.659e-02 2.360 0.018609 *
## Bsmt.QualFa -1.690e-01 4.323e-02 -3.910 0.000103 ***
## Bsmt.QualGd -1.042e-01 2.386e-02 -4.365 1.50e-05 ***
## Bsmt.QualNone -2.872e-01 4.526e-02 -6.346 4.44e-10 ***
## Bsmt.QualPo 1.961e-01 1.356e-01 1.446 0.148628
## Bsmt.QualTA -1.406e-01 2.953e-02 -4.763 2.41e-06 ***
## Central.AirY 1.751e-01 2.528e-02 6.924 1.16e-11 ***
## age -2.544e-03 4.298e-04 -5.918 5.57e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.115 on 583 degrees of freedom
## Multiple R-squared: 0.917, Adjusted R-squared: 0.9087
## F-statistic: 111.1 on 58 and 583 DF, p-value: < 2.2e-16
How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.
final_predicted_values <- exp(predict(aic_final_model))
final_residues <- ames_train$price - final_predicted_values
## Warning in ames_train$price - final_predicted_values: longer object length
## is not a multiple of shorter object length
final_rmse <- sqrt(mean(final_residues^2))
final_rmse
## [1] 114854.5
While testing the model on the out-of-sample data, the RMSE value 114854.5 proved that the model predicts very poorly the house price. This could be due to
NAs associated to itLet’s run the same for the final model (with outliers)
ames_test_transformed <- ames_transform(ames_test)
ames_test_transformed$Sale.Condition <- ames_test$Sale.Condition
model_var_level <- names(aic_final_model$xlevels)
aic_final_model$xlevels <- sapply(model_var_level, function(x) union(aic_final_model$xlevels[[x]], levels(ames_test_transformed[[x]])))
predict_test_data <- exp(predict(aic_final_model, newdata=ames_test_transformed))
## Warning in predict.lm(aic_final_model, newdata = ames_test_transformed):
## prediction from a rank-deficient fit may be misleading
final_test_residue <- ames_test_transformed$price - predict_test_data
final_test_rmse <- sqrt(mean(final_test_residue^2, na.rm = TRUE))
print(final_test_rmse)
## [1] 44150.01
The value of 44150.01 proves the model is very much improved.
Right Skewness causes a nearly not-normal distribution with left half having less data. (check the below histogram)
histogram(final_test_residue)
This is explained by
plot(no_outlier_final_model)
## Warning: not plotting observations with leverage one:
## 78, 127, 241
## Warning: not plotting observations with leverage one:
## 78, 127, 241
For your final model, calculate and briefly comment on the RMSE.
When we try to find the RMSE for the no-outlier model, the value is decreased, indicating that we can use it for the prediction of non-outliers.
model_var_level <- names(no_outlier_final_model$xlevels)
no_outlier_final_model$xlevels <- sapply(model_var_level, function(x) union(no_outlier_final_model$xlevels[[x]], levels(ames_test_transformed[[x]])))
predict_test_new_data <- exp(predict(no_outlier_final_model, newdata=ames_test_transformed))
## Warning in predict.lm(no_outlier_final_model, newdata =
## ames_test_transformed): prediction from a rank-deficient fit may be
## misleading
final_test_residue_no_outlier <- ames_test_transformed$price - predict_test_new_data
final_test_rmse_no_outlier <- sqrt(mean(final_test_residue_no_outlier^2, na.rm = TRUE))
print(final_test_rmse_no_outlier)
## [1] 38312.02
The below plot shows that type 2 (of final_test_residue_no_outlier) is well around 0 compared to the type 1 (of final_residues). The residues are much less scatterd than the training data compared to the plot which we have seen in Section 2.5
This shows the good sign of the model
residual_comp1 <- data.frame(index<- seq(from=1, to=dim(ames_train)[1], by=1), residue <- final_residues, type <- rep(1,dim(ames_train)[1]))
residual_comp2 <- data.frame(index<- seq(from=1, to=dim(ames_test_transformed)[1], by=1), residue <- final_test_residue_no_outlier, type <- rep(2,dim(ames_test_transformed)[1]))
colnames(residual_comp1) <- c("index", "residue", "type")
colnames(residual_comp2) <- c("index", "residue", "type")
initial_residual_comparison <- rbind(residual_comp1, residual_comp2)
ggplot(initial_residual_comparison, aes(x=index, y=residue)) + geom_point(aes(color=type, alpha=0.5))
final_interval_prediction <- exp(predict(no_outlier_final_model, interval="prediction", newdata=ames_test_transformed))
## Warning in predict.lm(no_outlier_final_model, interval = "prediction",
## newdata = ames_test_transformed): prediction from a rank-deficient fit may
## be misleading
percentage_covered <- mean(ames_test_transformed$price > final_interval_prediction[,"lwr"] & ames_test_transformed$price < final_interval_prediction[,"upr"], na.rm = TRUE)
percentage_covered
## [1] 0.8286414
82.86% of the housing prices are well reasoned by the variables.
What are some strengths and weaknesses of your model?
17.14% explanation38312.02 is still high.Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.
You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?
load("ames_validation.Rdata")
ames_validation_transformed <- ames_transform(ames_validation)
ames_validation_transformed$Sale.Condition <- ames_validation$Sale.Condition
model_var_level <- names(no_outlier_final_model$xlevels)
no_outlier_final_model$xlevels <- sapply(model_var_level, function(x) union(no_outlier_final_model$xlevels[[x]], levels(ames_validation[[x]])))
predict_validation_data <- exp(predict(no_outlier_final_model, newdata=ames_validation_transformed))
## Warning in predict.lm(no_outlier_final_model, newdata =
## ames_validation_transformed): prediction from a rank-deficient fit may be
## misleading
validation_residue <- ames_validation$price - predict_validation_data
validation_rmse <- sqrt(mean(validation_residue^2, na.rm = TRUE))
validation_rmse
## [1] 51398.32
The RMSE of 51398.32 shows that the model can still be improved. This is due the outliers present as below
plot(ames_validation_transformed$price)
Let’s remove the most priced 20 data and try again
ames_validation_transformed2 <- ames_validation_transformed[-head(order(ames_validation_transformed$price, decreasing = T), n=20),]
model_var_level <- names(no_outlier_final_model$xlevels)
no_outlier_final_model$xlevels <- sapply(model_var_level, function(x) union(no_outlier_final_model$xlevels[[x]], levels(ames_validation_transformed2[[x]])))
predict_validation_data <- exp(predict(no_outlier_final_model, newdata=ames_validation_transformed2))
## Warning in predict.lm(no_outlier_final_model, newdata =
## ames_validation_transformed2): prediction from a rank-deficient fit may be
## misleading
validation_residue <- ames_validation_transformed2$price - predict_validation_data
validation_rmse <- sqrt(mean(validation_residue^2, na.rm = TRUE))
validation_rmse
## [1] 47149.05
Now the value is decreased to 47149.05 as a good sign.
Also check the residue comparison plot. It shows that the predicted residues are in the same disperse level as the training data.
residual_comp1 <- data.frame(index<- seq(from=1, to=dim(ames_train)[1], by=1), residue <- final_residues, type <- rep(1,dim(ames_train)[1]))
residual_comp2 <- data.frame(index<- seq(from=1, to=dim(ames_validation_transformed2)[1], by=1), residue <- validation_residue, type <- rep(2,dim(ames_validation_transformed2)[1]))
colnames(residual_comp1) <- c("index", "residue", "type")
colnames(residual_comp2) <- c("index", "residue", "type")
initial_residual_comparison <- rbind(residual_comp1, residual_comp2)
ggplot(initial_residual_comparison, aes(x=index, y=residue)) + geom_point(aes(color=type, alpha=0.5))
validation_interval_prediction <- exp(predict(no_outlier_final_model, interval="prediction", newdata=ames_validation_transformed2))
## Warning in predict.lm(no_outlier_final_model, interval = "prediction",
## newdata = ames_validation_transformed2): prediction from a rank-deficient
## fit may be misleading
percentage_covered <- mean(ames_validation_transformed2$price > validation_interval_prediction[,"lwr"] & ames_validation_transformed2$price < validation_interval_prediction[,"upr"], na.rm = TRUE)
percentage_covered
## [1] 0.7648721
76.48% of the housing prices are well reasoned by the variables.
Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.
Overvalued vs Undervalues price vs area comparison. The values are determined from the IQR percentile of the house prices
undervalued = summary(ames_validation_transformed$price)[[2]]
undervalued_data <- ames_validation_transformed2[which(ames_validation_transformed$price <= undervalued),] %>% select(area,price,Neighborhood)
undervalued_data$type <- "undervalued"
colnames(undervalued_data) <- c("area", "price", "Neighborhood", "type")
overvalued = summary(ames_validation_transformed$price)[[5]]
overvalued_data <- ames_validation_transformed2[which(ames_validation_transformed$price >= overvalued),] %>% select(area,price,Neighborhood)
overvalued_data$type <- "overvalued"
colnames(overvalued_data) <- c("area", "price", "Neighborhood", "type")
misfit <- rbind(overvalued_data,undervalued_data)
suppressWarnings(ggplot(data=misfit, aes(x=log(area), y=log(price), color=factor(type), shape=factor(type))) + geom_point())
## Warning: Removed 8 rows containing missing values (geom_point).
Count of undervalued and overvalues home in Neighborhood
misfit %>% filter(!is.na(price)) %>% group_by(Neighborhood) %>% summarise(undervalued_count = sum(ifelse(type == "undervalued", 1, 0))) %>% arrange(desc(undervalued_count)) %>% slice(1:10)
## # A tibble: 10 x 2
## Neighborhood undervalued_count
## <fct> <dbl>
## 1 NAmes 38
## 2 OldTown 18
## 3 Sawyer 15
## 4 CollgCr 12
## 5 Edwards 12
## 6 Gilbert 12
## 7 BrkSide 9
## 8 Mitchel 9
## 9 IDOTRR 8
## 10 NWAmes 7
misfit %>% filter(!is.na(price)) %>% group_by(Neighborhood) %>% summarise(overvalued_count = sum(ifelse(type == "overvalued", 1, 0))) %>% arrange(desc(overvalued_count)) %>% slice(1:10)
## # A tibble: 10 x 2
## Neighborhood overvalued_count
## <fct> <dbl>
## 1 CollgCr 20
## 2 NAmes 20
## 3 OldTown 19
## 4 NWAmes 13
## 5 NridgHt 13
## 6 Gilbert 11
## 7 Sawyer 11
## 8 Edwards 10
## 9 Mitchel 10
## 10 BrkSide 9
bas.lm was not usable because of 2^n combinations and we have step in for finding the best model. Also we need to take care of Forwards/Backward selection of variables using adj-R-squared/BIC/AIC (whatever be the types) as it involves more variablesarea, Overall.Qual, MS.Zoning and agearea, price has to be applied to make sure that the model is good